Exploring Image Data

Luis Garduno

1. Business Understanding

       About STL-10

Inspired by the CIFAR-10 dataset, STL-10 is a dataset containing a combination of images (gathered from ImageNet) of animals and transportation objects. Within the dataset there are 6 animal & 4 transportation object classes:

The dataset contains 3 folders that will be used at specified times:

Aside from having not having identical classes, another difference between the datasets, is that the images in STL-10 are 3x's the resolution of CIFAR-10's images (96x96 versus 32x32).

STL-10 is specifically an image recognition dataset. The dataset is intended to be used for developing unsupervised feature learning, deep learning, self-taught algorithms. That being said, the primary prediction task is to determine the type of animal or transportation object found in each of the pictures in the Unlabeled folder. Something that should be noted about the "Unlabeled" folder, aside from it containing the the classes mentioned above, it additionally includes other types of animals (bears, rabbits, etc.) and transportation objects (trains, buses, etc.).

       Measure of Success

One reason this data is important is if trained correctly & the prediction task is achieved, third parties that use image captcha's for their websites, networks, etc. could use this data as a way to visualize how captcha's can be bypassed by unsupervised feature learning, which essentially defeats the purpose of having a captcha test.

In order for this data to be of use to third parties using captcha's, I believe the prediction algorithm will have to render at least an 80% accuracy. The reason it isn't 90% is because if the prediction algorithm selects a wrong image, or doesn't recognize an image, often times captcha test's will let you get away with about 2 or less errors.


Dataset : STL-10 Kaggle Dataset

Question Of Interest : Identify the type of animal or transportation object shown in the picture


2. Data Preparation

       2.1 Loading Data & Adjustments


We begin by reading the dataset's (train & test) into numpy array's, but because they contain colored images, it would be optimal to turn these array's into only containing grayscale values so we are able to compute faster.

Then after doing so, the shape's of the original matrices and grayscale's are outputted to display the initial dimensions. To the right of these shapes, are the concatenated versions of those matrices.

The output of cell 10 is created to better understand the differences between the 4 matrices, two containing Original color pictures, and the other two containing Greyscaled colored images. Here is where we notice the large distance between the image sizes for each matrix. Notice how each Greyscaled picture is 3 times smaller than the Original color pictures.

We begin by reading the dataset into a numpy array, but because it contains colored images, it would be optimal to turn it into a grayscale array so we are able to compute faster.

Then after doing so, the shape of the original matrix and grayscale are outputted to show the initial dimensions. To the right of these shapes, are the concatenated versions of those matrices.

At the bottom a table is created to better understand the differences between the 2 matrices, one containing color pictures and the other one containing greyscaled images. Here, is where we notice the large distance between the image sizes for each matrix. Notice how each greyscaled picture is 3 times smaller than the original color pictures.

       2.2 Visualizing Images

Here we visualize 18 images within the greyscale numpy array. This function will be helpful later on to output certain images given a certain certain array.


3. Data Reduction

       3.1 Linear Dimensionality Reduction

             3.1.1 PCA

             3.1.2 Randomized PCA

             3.1.3 PCA vs Randomized PCA

       3.2 Feature Extraction

             3.2.1 Image Gradients

             3.2.2 Daisy Features


       3.3 Visualizing Feature Extraction efficiency

References

Kaggle. STL-10. https://www.kaggle.com/jessicali9530/stl10 (Accessed 9-25-2020)

Adam Coates, Honglak Lee, Andrew Y. Ng An Analysis of Single Layer Networks in Unsupervised Feature Learning AISTATS, 2011.